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Abstract 

Depth maps obtained from commercially available structured-light 
stereo based depth cameras, such as the Kinect, are easy to use but 
are affected by significant amounts of noise. This paper is devoted to 
a study of the intrinsic noise characteristics of such depth maps, i.e. 
the standard deviation of noise in estimated depth varies quadratically 
with the distance of the object from the depth camera. We validate 
this theoretical model against empirical observations and demonstrate 
the utility of this noise model in three popular applications: depth map 
denoising, volumetric scan merging for 3D modeling, and identification 
of 3D planes in depth maps. 


1 Introduction 

3D scanning is used extensively for many computer vision applications, e.g. 
human-computer interface (HCI), virtual reality, game programming, indus¬ 
trial monitoring, archeology, etc. Although applications such as HCI or gam¬ 
ing generally demand speed rather than precision, the accuracy of the 3D 
reconstruction is of crucial importance for many other tasks including arche¬ 
ology or quality monitoring in industrial production. The recent commercial 
availability of inexpensive structured-light depth cameras has opened up new 
possibilities for 3D scanning and shape reconstruction. Such devices were 
originally intended for human pose estimation in a gaming context but an 
extensive body of research by the vision community has demonstrated their 
effective use in 3D shape reconstruction [20l [33l |50l |48l [39l |35l [HI |42l [52l [53] . 
While the low cost, ease of use and availability of depth maps at video 
frame rate make such depth cameras very attractive, these devices do suffer 
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from high level of noise in the raw depth maps which needs to be addressed 
before such depth cameras can be used for 3D scanning or reconstruction. 
Therefore, it is crucial to understand the accuracy and noise characteris¬ 
tics of structured-light depth cameras and develop appropriate methods to 
mitigate the effects of such noise on the final 3D reconstructions or other 
representations. In the remainder of this section we brieffy survey the cur¬ 
rent literature in this regard. In Section we brieffy discuss the working 
principle of a structured-light stereo based depth camera and show how the 
quadratic nature of depth noise is inherent to the working principle of such 
depth cameras. We also validate our claim with thorough experimentation. 
In Section we show the effectiveness of this noise modeling in depth map 
denoising, volumetric scan merging and plane extraction. Finally, we provide 
some conclusions in Section [H 

The accuracy and noise characteristics of structured-light depth cameras 
like the Kinect has been investigated in recent years [a Ea sn EH ED- 
A good description of the working principle of the Kinect is available in 
In addition, several studies have been performed to denoise the 
depth maps obtained from such devices. We may categorize these approaches 
into ones that make no explicit attempt to characterize the depth noise 

and those that study the behavior of 
noise in the depth images isiiiilEaEaEi. Amongst the methods of the first 
category that do not model depth noise, a number of applications, including 
Kinect Fusion [33], use bilateral smoothing of depth maps in their pipelines. 
Other works utilize the available RGB images captured simultaneously with 
the depth maps iniElEHl- In [SH, the authors exploit the assumption that 
edges in depth map and RGB image should occur together. Similar ideas are 
also exploited in [niEElIlZlESlESlEllIIHlE]. in PHITH]. photometry is used 
to refine depth maps. However, [33l |3ll |6l |3H1 091 [IH] do not recognize the 
underlying characteristic of noise inherent to structured-light stereo based 
depth cameras. Therefore, some of them rely on additional cues from RGB 
images to denoise depth maps. 

In contrast, there are some other approaches that seek to understand the 
nature of noise present in depth camera. m uses extensive empirical mea¬ 
surements to model the noise characteristic of depth sensors. They argue 
that accounting for the noise in this fashion significantly improves both re¬ 
construction and tracking in the Kinect Fusion pipeline. Like ED, the work 
in m also empirically notes that the noise in Kinect depth maps have a 
quadratic relationship to depth. However, neither of these studies recognizes 
the fact that the quadratic nature of the standard-deviation of depth noise 
is inherent to the working principle of structured-light stereo based depth 
cameras. In 1221 ED, the working principle of structured-light stereo cameras 
was utilized to develop a model for the noise present in depth maps. How¬ 
ever, no attempt was made in [221 ED to incorporate the noise model into 
any application method or estimation framework. 
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In our work, we both derive a noise model for structured-light depth 
cameras and demonstrate its utility in a variety of applications. In an earlier 
conference paper [7] we independently derived the theoretical noise model 
like the one presented in 1221 [ 23 ]. In Section we show that the quadratic 
nature of depth noise is inherent to the working principle of structured- 
light stereo depth cameras. This theoretical observation is also validated 
thorough experimental observation. Furthermore, unlike the work of [22l[23], 
we demonstrate in Section how our theoretical understanding of depth map 
noise can be carefully exploited in a variety of applications such as depth map 
denoising, volumetric scan merging and 3D planar surface extraction. Before 
developing our noise model, we would like to comment on the nature of the 
present work. There exists a large body of literature on the ingredients of 
3D reconstruction pipelines using depth cameras [20l [33l [50l |48l |39l |35| El HU 
[521 [53]. Although we also incorporate our noise model into a reconstruction 
pipeline to generate the 3D reconstruction results, this paper is not about 
scan registration or 3D reconstruction pipeline. Rather we emphasise the 
development of a noise model for depth cameras and demonstrate the value 
in using this noise model in various applications. 

2 A noise model for depth cameras 

In this Section, we briefly state the working principles of structured-light 
depth cameras and then theoretically derive a noise model for such depth 
cameras. We also provide empirical validation of the noise model that we 
develop. We use the Kinect as our depth camera for all the experiments in 
this section, although our ideas can also be applied to any other structured- 
light depth camera. The Kinect depth camera is a structured-light stereo 
system that consists of an infra-red (IR) projector and an infra-red camera. 
The projector is an IR laser operating at a wavelength of around 830nm 
that is placed behind a diffraction grating. This diffraction grating converts 
the single beam of the laser source into a fixed dot pattern. This pattern is 
projected on the 3D scene in front of the device and is reflected back to the IR 
camera of the device. Together the IR projector and camera act as a stereo 
pair. It may be noted here that a projector is geometrically equivalent to a 
pin-hole camera except that the direction of light is reversed. In addition, we 
recognise that the projector is equivalent to a bundle of rays with a common 
point. Therefore, in terms of projective geometry, we can associate a virtual 
projector plane where each projector pixel is a member of an equivalence 
class defined by a unique IR ray passing through it, i.e. projector pixels 
are the inhomogeneous representations of the corresponding IR rays. The 
reader is referred to m for details on the projective equivalence of pixels on 
a camera plane and a bundle of rays with a common point. From the above 
discussion we note that, for our geometrical analysis we can treat the Kinect 
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as a pair of stereo cameras in canonical position. 


2.1 Noise Model 


By correlating patches of the viewed IR camera with a known emitted pattern 
(i.e. known virtual projector image), one can obtain the disparity estimates 
in the IR camera plane [28l d]. The depth (Z) or distance of an object 
is inversely proportional to this disparity (D) i.e. D = ^ where B is the 
baseline distance between the optical centers of the camera and the projector 
and / is the focal length of the camera. It has been observed that the Kinect 
is precise enough to need no stereo rectification. Similarly, non-linear lens 
distortions are negligible. 

While factors such as sensor noise, the IR component of ambient light, 
reflectance of objects etc. cause inaccuracies in disparity measurement ob¬ 
tained with patch correlation, the most significant source of disparity error is 
quantization noise. Such quantization noise arises when we estimate dispar¬ 
ity with a given finite precision, i.e. disparity estimates are allowed to only 
take on finite values. Therefore, our estimation of disparity can be modeled 
as being corrupted with quantization noise with a fixed standard deviation. 
This constant noise level applies to all disparity values irrespective of the 
true disparity or correspondingly the actual distance of objects. Let the 
true disparity be Dq and let the observation model for estimated disparity 
he D = Dq + n where n is an additive noise with fixed standard deviation. 
It will be noted that the dominant component of n is quantization noises. 
Therefore we have the following relationship 
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Thus the standard deviation of noise in depth measurement is proportional 
to the square of the depth of objects. This implies that the precision of 
estimating depth using a stereo projector-camera pair falls off according to an 
inverse square law. Equation can be equivalently derived by differentiating 
disparity with respect to depth as follows. 


^ = ^ ^ = ( 2 ) 

dZ dD fB ^ ’ 

i.e. the noise or uncertainty in disparity estimation is amplified by a factor 
proportional to the square of depth when we convert disparities into depth 
estimates. To see the implications of this quadratic relationship we can con¬ 
sider the following comparison. The baseline distance between the projector 
and camera in the Kinect is S = 75 mm [T]. The focal length of the IR 
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(a) Scan of plane (b) Zoomed view 


Figure 1: Scan of a floor plane with Kinect and a zoomed in view that shows 
the quantization noise in it. 


camera is found to be 587 pixels in our calibration. This focal length trans¬ 
lates to a field-of-view of about 57° which is consistent with the available 
manufacturer specifications mm- Now, if we consider two depth values of 
say 600 mm and 1500 mm (i.e. 2 feet and 5 feet approximately) we have the 
depth sensitivity values of 


dZ 

W 


Z=600mm 

Z=1500mm 


600 ^ 

75 X 587 
1500^ 
75 X 587 


—8.2 mm/pixel 
—51.1 mm/pixel 


i.e., an error of 1 pixel in disparity estimation translates to a depth error of 
8.2 mm and 51.1 mm at 2 and 5 feet respectively. 

In Figure [^a) we show a 3D scan of a floor obtained using a Kinect. In 
Figure [^b) we show a zoomed-in view of a small region of this scan. As is 
evident, the Kinect scan is extremely noisy due to the quantization effect in 
disparity estimation. 


2.2 Empirical Validation 

We now provide empirical validation of this theoretical observation. To un¬ 
derstand the behavior of such a depth estimate, we extract all unique depth 
values from the scan of the floor in Figure In Figure |^a) we plot these 
unique values in ascending order. We may note that for our purposes, we 
could have carried out this experiment using any object surface as long as 
the observed object contains a continuous range of depth values. A 3D plane 
such as a floor is eminently suitable for this purpose. The rate of change 
between the unique values of Z provides us information on the resolution of 
the depth estimate. Specifically, we define the quantity AZ as the differ¬ 
ence between adjacent values in the sorted sequence of unique depth values, 
i.e. {AZ}k = {Z}k — {Z}k-i^ where {Z}k is the k-th term in the sorted 
sequence of unique depth values. If the resolution of the depth estimate was 
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(b) log-log plot of resolution vs. depth 



iog(Z) 


(c) Simulated log-log plot of resolution vs. 
depth 


Figure 2: Analysis of depth resolution of a structured-light stereo range 
scanner (Kinect). Please see text for details 


uniform for all Z, then the plot of unique values of Z should have been a 
straight line, i.e. the plot {Z}k vs. k should have a constant slope or AZ^ 
would have been a constant. However, from Figure [^a) it is clear that as 
depth (Z) increases the resolution of Z becomes poorer. To understand the 
nature of the changing depth resolution, in Figure [^b), we plot the function 
log(AZ) vs. log(Z). The straight line fitted through the observed points 
has an estimated slope of 1.967, i.e. a slope almost equal to 2, thereby 
empirically verifying our theoretical argument in this paper that the depth 
sensitivity is proportional to Z^. In other words, for the same perturbation 
error in disparity measurement, we get a much larger perturbation of depth 
estimate when Z is larger. We parenthetically remark here that the block¬ 
like ringing of log(AZ) is due to the finite-precision of the depth estimation 
algorithm implemented on Kinect. Recall that in any disparity measurement 
in a structured-light stereo configuration, we first estimate the best disparity 
D. This solves for the disparity D with finite resolution, i.e. the estimated 
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D can only take on a discrete set of values specified in units of pixels. 

Before proceeding further we need to establish the disparity resolution (in 
pixels) for the Kinect. It will be noted that the stereo method for measuring 
disparity would evaluate the stereo matching cost function at a finite num¬ 
ber of subpixel disparities. To establish this disparity resolution for Kinect, 
we conducted a simple experiment. From the Kinect depth map values of 
the floor plane shown in Figurea), we estimate the corresponding dispar¬ 
ity {D — ^) dit every pixel. Subsequently we sort these disparity values 
in ascending order. In this sorted list, we notice that there are atmost 8 
disparity estimates between two integer disparities, i.e. the subpixel reso¬ 
lution of disparity estimation in the Kinect is | pixel. When the distance 
is large, i.e. disparity is small, there are exactly 8 disparity values between 
two subsequent integers. When the disparity is high, there are less than 8 
disparity values since the output depth is also quantized to a precision of 
1 mm. Hence, for large disparity (or small depth) two or more consecutive 
unique disparity values are rounded-off to the same depth estimate due to 
quantization. 

Naturally, when we estimate Z = ^, we can only get a finite number 
of depth estimates, i.e. estimated Z cannot be continuous valued. It will 
also be noted that as Z increases, this Tinging effect’ seems to be diminish¬ 
ing as a consequence of the non-linear nature of the log-log plot. To better 
understand this phenomenon we perform the following synthetic experiment 
that seeks to replicate the observations depicted in Figure [^b). We take 
a sequence of depth values from 50 cm to 3 m with a resolution of 1 mm. 
We then estimate disparities for these depth values with the known baseline 
distance and focal length of the Kinect, i.e. / = 587 pixels and B — 75mm. 
Estimated disparities are quantized at 1/8 pixel resolution as we have es¬ 
tablished above that the Kinect’s disparity resolution is | pixel. Then the 
depths are estimated from these quantized disparities. Subsequently, depth 
values are also quantized with a precision of 1 mm. Finally, unique values 
of depths are extracted and log(AZ) vs. log(Z) is plotted in Figure [^c) 
and the corresponding Matlab code is given in Algorithmic The synthetic 
plot in Figure [^^c) very closely resembles the empirical observation of Fig¬ 
ure IC^b) thereby validating our model for noise in depth estimation using 
structured-light stereo scanners like the Kinect. 


Algorithm 1 Matlab code for simulation of noise-resolution relationship in 
Kinect depth data. Figure |C^e) was generated using this code. See text for 
details. 


1. b=75; f=587; 

2. z=(b*f./(round(b*f./(500:3000)*8)/8)); 

3. z=unique(round(z)); 

4. plot(log(z(l:end-l)),log(diff(z))) 
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3 Applications 

In Section |2] we have established the fact that the standard deviation of noise 
in structured-light stereo based depth maps increases proportionally with the 
squared distance of an object. In this Section we demonstrate the utilization 
of this fact in three different applications, i.e. depth map denoising, weighted 
Volumetric Scan Merging using pixel-wise uncertainty and Plane extraction 
using the disparity map. We have chosen these three applications as they 
are amongst the most common approaches of using Kinect depth maps for 
3D scene representation and understanding. 

3.1 Smoothing of Depth Images 

The enhancement of depth maps is an important problem by itself. Recent 
works in depth map enhancement include HSlElliniEaEIlESlEnillZlEil 
EIEIEH]. ISD El EH Sa ESI EH EH Ea E] use a guidance color image for 
smoothing the depth map using the consistency between a depth map and 
an aligned color image. The use of a guidance image enables these methods 
to perform hole-hlling near the object boundaries. Since the focus of our 
paper is the noise characteristic of depth maps, we do not use any additional 
information apart from the depth map itself. Therefore, we do not perform 
any hole-hlling in this paper. In EDI, temporal huctuations of the depth 
map is used for enhancing the depth estimate. In this paper we emphasize 
on using the information available in a single depth map to denoise it. In m 
different methods of hltering a depth map are compared. m discusses the 
application of depth map enhancement for 3D tele-communication. Apart 
from aesthetic reasons, depth map denoising is a crucial prerequisite for scan 
registration and other downstream processing in 3D reconstruction. Take for 
instance, the well-known Iterative Closest Point algorithm which relies on 
point-to-plane |29| distance measurements to register 3D scans in a common 
frame of reference. Such an ICP approach relies on estimating the normal 
vector at a point on a 3D surface. Normal vectors computed from a scan 
obtained from a raw depth map are extremely noisy and unreliable. This 
is so because normal estimation involves discrete differentiation operations 
which amplihes the noise present in the depth maps. Therefore it is essential 
to smooth the scan representation for ICP to work effectively. Since it is 
generally cumbersome to smooth 3D surface representations, such smoothing 
can be equivalently carried out directly on the depth maps. Indeed, methods 
such as [33] rely on smoothing the observed depth maps in a preprocessing 
step. 

As in image denoising, depth maps can also be efhciently denoised or 
smoothed using a neighborhood kernel. Such a local smoothing operation 
relies on the fact that nearby pixels should have similar intensity (or equiv¬ 
alently depth) values. While typically hlters weight the influence of pixels 


according to their distance from the central pixel, e.g. a Gaussian smooth¬ 
ing filters, we often need to account for intensity or depth discontinuities, 
i.e. the violation of the assumption that neighboring pixels have similar val¬ 
ues. A commonly used modification is the bilateral filter [45| which modifies 
the weighting to account for variation of intensity, thereby effectively carry¬ 
ing out a robust smoothing operation. The standard bilateral filter applied 
to depth images implicitly assumes that depth values have uniform uncer¬ 
tainty. In the following subsection, we demonstrate how we can incorporate 
our understanding of depth map noise into the bilateral filter to significantly 
improve its effectiveness in depth map denoising. This approach is called 
Adaptive Bilateral Filtering. 

3.2 Adaptive Bilateral Filtering 

Consider an observed depth map Z{p) where p = (x, y) denotes the location 
of a pixel. The standard approach of bilateral filtering [45] gives us a denoised 
depth estimate Z{p) as 

Zip) = ^ ^ Wsici-p)wd{Z{q) - Z{p))Z{q) (3) 

qeAr(p) 

where Wg and are Gaussian functions for spatial and range weighting with 
standard deviations of ag and ad respectively, J\f{p) is the neighborhood of 
p and W is an overall normalizing factor to have a total sum of 1 over J\f{p). 
In other words. 


Wg{'x.) oc e 2 c7s2 

(4) 

Wd{y) (X e 

(5) 


In Equation!^ along with spatial smoothing, the range weight Wd explicitly 
accounts for the depth differences between the central pixel and other pixels 
in the support of the smoothing kernel. As a result, using a bilateral filter 
effectively reduces the influence of neighbors of p that have greatly different 
values, i.e. violate the implicit assumption that pixels within the support 
of the smoothing kernel have similar values. Consequently bilateral filtered 
depth map is smoothened but also preserves depth edges. Although the bi¬ 
lateral filter is an effective strategy, it is inadequate to deal with the problem 
of smoothing depth maps from structured-light stereo cameras. Recall that 
the choice of a fixed ad in the bilateral filter means that we use a fixed ‘soft 
threshold’ to distinguish between similar neighboring pixels and depth dis¬ 
continuities. However, as argued in Section ad for depth maps is not fixed 
but varies quadratically with depth. Moreover, since the projector bundle 
of rays are divergent, the surface sampling density is lower for objects that 
are far from the scanner. From these two observations, its obvious that for 
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(a) Raw Scan (b) Bilateral Filtered (c) Adaptive Bilateral Fil¬ 

tered 


Figure 3: (a) shows the noisy raw scan of a planar surface (corridor floor) (b) 
shows the result of applying the standard bilateral Alter to this scan. Notice 
that while the lower region, which is closer to the scanner is smoothed, 
the upper portion which is far away is not adequately smoothed, (c) This 
problem is mitigated by the use of our adaptive bilateral Alter. 


distant objects, the uncertainty of their depth estimate is itself large. In 
such a scenario, using a Axed is undesirable. For instance, a small 
means that distant objects are not adequately denoised. Conversely, a large 
ad leads to oversmoothing of surfaces that are closer to the depth sensor. 
In Figure [^a) we show the mesh representation of a depth image of a planar 
surface, i.e. the floor of a corridor in our building. We have rotated the mesh 
representation into an upright view such that the closer part of the floor is 
at the bottom and the distant part of the floor is at the top of Figure |^a), 
i.e. depth increase from bottom to top. In Figure [^b), we show the result of 
applying the standard bilateral Alter of equationto it. While for a given ad 
the standard bilateral Alter in Figure [^b) works well for points on the floor 
that are close to the scanner, we can see that for points that are further away 
(i.e. the upper part of Figurej^a)) smoothing by the standard bilateral Alter 
is inadequate. Since our analysis shows that the depth estimate sensitivity 
is quadratically proportional to the depth itself, we modify ad to vary as 
i.e. ad — kZ‘^^ where fc is a constant. As a result, our modifled Adaptive 
Bilateral filter is the same as that of equation with the modiflcation that 
instead of ad being a constant, for each pixel p, we have 

Gdiv) = ( 6 ) 

In Figure [^c) we show the result of applying our adaptive bilateral Alter 
to smoothen the raw depth map shown in Figure]^ a). In contrast to the 
standard bilateral Alter, our adaptive Altering shown in Figure |^c) gives 
superior results for all depths since the smoothing in the range kernel Wd 
takes into account the speciflc generative model for the depth image. 

We provide an additional example of the superiority of our adaptive 
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(c) Bilateral Filtered 


(d) Adaptive Bilateral Filtered 


Bilateral 


Adaptive Bilateral 


Guassian 


(e) Comparison of performance of different filters 


Figure 4: (a) original noisy raw scan of a scene (b) result standard Gaussian 
filter (c) result of standard bilateral filter (d) result of our Adaptive Bilateral 
Filter, (e) comparison of the filtering performance of different methods for 
three zoomed-in regions. 










bilateral filter in Figure on a more natural depth map of an individual in 
the foreground with a background consisting of a door and two walls at 90° to 
each other. The original raw depth map is shown in Figure Qa). Apart from 
the depth discontinuities between the foreground figure and the background 
wall, there are also some small holes in the depth map due to occlusions. 
The result of Gaussian smoothing as shown in Figure [^b) is obtained by 
choosing a large enough standard deviation to effectively remove the noise in 
the scan. However, as can be seen from the details shown in the zoomed-in 
regions in Figure Qe) this also results in the blurring of sharp depth edges. 

In contrast, the bilateral filter can preserve depth discontinuities (Figure 
l^c)). However, since there are greatly varying depth values in this scene, it 
cannot provide the appropriate level of smoothing for all parts of the scene. 
For instance, we can see in Figure [^e) that the level of smoothing chosen is 
inadequate to denoise the back wall which has a higher noise variance. If we 
had chosen a higher value of here, we would end up oversmoothing the 
depth map of the individual in the foreground. In contrast, our Adaptive 
Bilateral Filter automatically tunes ad to perform the desired smoothing for 
any depth value. The result of applying our adaptive bilateral filter to the 
depth image is shown in Figure |^d). We can see that both the foreground 
and background regions are appropriately smoothed while preserving depth 
discontinuity features since our filter adaptively modifies the standard devi¬ 
ation of the range dimension of the bilateral filter. This observation can also 
be noted when we look at the different zoomed-in regions shown in Figure 

S'*)- 

3.3 Volumetric Scan Merging 

Scan merging is an important component of any 3D object or scene recon¬ 
struction pipeline. For object reconstruction, an object is scanned from 
different directions. Each scan can view only a part of the object. These 
scans are then registered or brought into a single frame of reference. In 
practice, such registration is carried out by utilizing the common features 
in overlapping regions present in multiple scans. As a consequence once the 
scans are registered, multiple estimates of the overlapping surface regions 
are available. These multiple estimates need to be converted into a single 
unified surface representing the reconstructed object or scene. 

This process is called scan merging. Before further discussion on scan 
merging, to make this article self-contained, we shall briefly discuss how 
depth maps or range scans are converted into 3D surface representations. 
Depth maps or range scans are sampled 2D representations of 3D surfaces. 
The value assigned to a pixel at {x^y) in a depth map represents the depth 
(distance in the direction of principal axis) Z[x^y) of the 3D point which 
is imaged at {x^y). Let the focal length of the IR camera be / and the 
principal point be at {u^v). Then, the depth value of Z{x^y) at the pixel 
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{x^y) represents the 3D point P = {X^Y^zY' such that, 
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Thus, every pixel in a depth map with a valid depth value represents a 3D 
point which obeys equation The collection of such 3D points is often 
called a point cloud and is a sampled description of the scanned 3D surface. 
After converting depth maps into 3D scans, they are aligned or brought into 
a single coordinate frame of reference. This process is known as registration 
and is often carried out using the ICP algorithm and its variants [40]. Fi¬ 
nally, multiple co-registered scans of a scene need to be converted or merged 
into a single unified representation. For scan merging, one relatively simple 
approach is that of zippering m- In zippering, first overlapping parts of 
scans are eroded and then stitched along the common boundaries. In a final 
consensus step, vertex positions are re-estimated using the original scans. 
Another method known as volumetric scan merging m is often the method 
of choice as it is quite effective under practical situations. In the follow¬ 
ing, we present a brief overview of volumetric scan merging and readers are 
referred to m for details. 

In volumetric scan merging, we consider a 3D volume that contains or 
encapsulates all the co-registered scans. For implementational purposes, this 
3D volume is discretized into voxels centered on grid points. To achieve a 
seamless fusion of the multiple measurements in overlapping scan regions, 
each scan is converted into a truncated-signed-distance-function (TSDF) 
/^(X, y, Z) where the index i denotes the i-th scan. The TSDF fi{X^Y^Z) 
is defined over the encapsulating volume and is computed at the discrete set 
of 3D grid points (X, T, Z) uniformly placed within the volume, i.e. at the 
center of each voxel. The TSDF magnitude |/i(X, T, Z)| at any grid point 
(X, y, Z) represents the distance of the nearest point on the i-th. scan along 
the corresponding line of sight. To limit the influence of distant surfaces, 
these distance values are clipped beyond a maximum value, i.e. truncated. 
The sign of TSDF at a point represents whether the point is nearer or farther 
from the camera compared to the surface. 

Let us consider the i-th depth map. Suppose the rotation and the trans¬ 
lation for the i-th position of the scanner is given by (3 x 3 orthonormal 
real matrix with determinant 1) and (3 x 1 real vector) respectively. Let 
the calibration matrix of the scanner be (3 x 3 upper triangular matrix 
with K^(3,3) = 1). In our case, we assume that we use the same scanner 
throughout, i.e. = K0 Then the projection matrix for the i-th position 

^We parenthetically note here that IR cameras can be calibrated using the standard 
procedure of estimating homographies [5] by observing a checkerboard pattern illuminated 
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of the scanner is given by the 3x4 matrix 

Mi = Ki [ Ri I Ti ] (8) 

For a 3D point P = (X, y, Z), its image in the i-th depth map is projected 
onto the pixel where pi satisfies the relationship 
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where in equation]^ the equality is a projective relationship m- Therefore, 
if the depth value at pixel p^ = (xi^yi) in the i-th scan is Zi{xi^yi)^ we can 
identify it with a 3D point with the coordinates (with respect to i-th cameras 
local coordinate system) given by 


■ Vi ■ 


{xi-Ui)Zi{xi,yi) 

fi 
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{yi-Vi)Zi{xi,yi) 
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( 10 ) 


Clearly, denoting the 3D point as P = (X, T, Z)^, the distance of the 
surface from the center of the camera in the i-th position along the line of 
sight passing through the pixel yi) is given by \\Pi\\. Also, the distance of 
the grid point P = (X, T, Z)^ from the center of the camera in i-th position 
is given by IIRiP + T^|| where ||.|| is the 2-norm. Therefore, the TSDF is 
given as 


fi{x, Y, Z) = max (min (||RjP + Tj|| - ||Pi||, fmax), fmin) (H) 


where fmin and fmax are the minimum and maximum values of TSDF’s, 
beyond which TSDF’s are clipped. The TSDF’s fi{X^Y^Z) from all the 
scans are then summed up with appropriate weights Wi{X^ T, Z), leading to 
a unified representation 


F{X,Y,Z) 


j:,Wi{X,Y,Z)fi{X,Y,Z) 

j:imix,Y,z) 


( 12 ) 


The zero crossing surface of T(X, T, Z) in the encapsulating volume is the 
merged and unified representation of all the co-registered scans. This unified 
representation is obtained by performing an iso-surface extraction which is 
in practice implemented using a marching cube algorithm [271 [32] . It will 
be noted that the process of summing up of individual TSDF’s effectively 

with IR radiation from an incandescent light source. Alternatively, the IR projector of the 
depth camera can be used as an IR source by placing a semi-transparent sheet of paper 
in front of the projector. 
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averages the position of surfaces with multiple observations. This helps in 
denoising or reducing the uncertainty of vertex positions. However, to cor¬ 
rectly account for the information present in each observation, the different 
measurement representations /^(X, T, Z) should be weighted by an appropri¬ 
ate function Wi{X^Y^ Z) that reffects its uncertainty. Consider the scenario 
where a surface might be observed from two different distances. As noted 
earlier, the standard deviation of the observation made from a location fur¬ 
ther from the surface is higher compared to that made from a closer position. 
Consequently, unless the weighting function accounts for the relative accu¬ 
racy of these observations, in the superimposed TSDF function F(X, T, Z) 
of equation!^ the precision and details present in the closer observation will 
be lost due to the inordinate inffuence of the noisy distant observation. Since 
from Equation]^ we can see that the point (X,Y,Z) projects to in 

scan i, the corresponding weight Wi{X^ T, Z) is derived from the uncertainty 
of the depth value at the pixel (xi^yi). 

Now, let us consider a set of scalar observations Xi — fi + rii where the 
noise terms ni are Gaussian independently distributed, i.e. ni ^ X(0,(j?) 
with varying cr^. It can be easily seen that the maximum likelihood esti¬ 
mate (MLE) for /i is given by /i oc In other words, each individual 

i 

measurement xi is weighted by a factor inversely proportional to the vari¬ 
ance of that observation. Since in our scenario of depth map measurements, 
the scale depth estimate Z{xi^yi) has standard deviation proportional to 
Z{xi^ yi)‘^. Therefore, the weights we assign are inversely proportional to the 
fourth power of depth, i.e. 


Wi{X,Y,Z) = Wi{xi,yi) = 


Vi) 


(13) 


It will be evident from equation that any scalar constant factor has 
no inffuence on the summed TSDF. As a result, we need not incorporate 
any normalization constant in equation [T^ and can simply equate Wi with 
the inverse of the fourth power of depth Zi. While summing up TSDF’s 
at a grid point each TSDF is weighted by the weight assigned to them at 
that point. Under the assumption of orthographic projection and using the 
fact that the range errors are independently distributed along the sensors 
line-of-sight, it can be shown that the volumetric scan merging method is 
optimal in a least square sense m- Hence, with these assumptions, our 
weighting scheme gives a maximal likelihood estimate of the surface when 
we model the depth estimates to have Gaussian noise with standard deviation 
quadratically varying with the distance of an object. 

We demonstrate the effectiveness of the weighting scheme of equation 
with the following experiment. We scanned a life-size statue of Mahatma 
Gandhi located on the grounds of Sabarmati Ashram in Ahmedabad from 
varying distances. In Figure |^a) we show a scan taken from close range, i.e. 
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(c) (d) 


Figure 5: Scan of Mahatma Gandhi’s statue taken from long (a) and short (b) 
distances. Results of merging (a) and (b) with unweighted and our weighted 
volumetric method are depicted in (c) and (d) respectively. 
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an average distance of approximately 75 cm whereas in Figure [^b) we show 
a scan of the same statue taken from a greater distance of about 1.5 m. While 
the area of the statue covered by the more distant scan is larger than the 
nearer one, for simplicity of visualization we crop the depth maps to show the 
same common parts of the statue. Figure [^c) is the result of the standard 
volumetric merging of these two scans (i.e. with weights Wi = 1) and Figure 
l^d) is the result of volumetric merging using our weighting scheme. It is 
evident from a comparison of Figure [^c) and Figure [^d) that our weighting 
scheme provides a much better result than standard volumetric merging. 
This is because the scan in Figure [^b) is taken from a closer distance and 
has more details than the distant scan of Figure [^a). If they are averaged 
with equal weights, then all the details of the closer scan are sacrificed. In 
contrast, our weighting scheme preserves these details as it naturally assigns 
greater weightage to the observations taken from the closer scan. 

In Figure we show the complete 3D reconstruction of the life-size statue 
of Mahatma Gandhi. This complete 3D model is built with 29 scans taken by 
walking around the statue. As this statue is located outdoors we acquired 
the scans at night since commercially available IR depth cameras do not 
work in the presence of moderate to strong sunlight. As mentioned earlier, 
this model is scanned from widely varying distances. While the scans taken 
from a distance covers a larger portion of the statue and hence help in accu¬ 
rate registration, scans taken from a closer position have much better detail. 
Closer scans have a greater amount of detail since the uncertainty in depth 
estimates is lower and at the same time a small surface area is imaged over a 
larger number of depth map pixels (compared to a long range scan), thereby 
providing a larger number of detailed surface measurements. Therefore, for 
successful 3D modeling, we need to allow for both types of scans. Neverthe¬ 
less, as seen earlier, while merging scans taken from different distances, it 
is essential to weight them according to their inherent accuracy to achieve 
the best possible 3D reconstruction. As can be seen from Figure [^b-c), this 
goal is successfully attained with our weighted volumetric merging scheme. 
In contrast, we can see from Figure [^a) that the standard unweighted volu¬ 
metric scan merging does poorly in comparison as it does not assign a lower 
level of influence to scans taken from a greater distance. 

In Figure we show a cross section of the TSDF functions generated 
with the ordinary unweighted approach and our weighted averaging schemes 
for this example. It is clear that our approach produces a function which 
better preserves details compared to the ordinary averaging approach. For 
instance, in Figure the outline of the lips is seen to be more prominent in 
our approach. This observation has been further illustrated in Figure [^c) 
which compares the zero crossing curves extracted from both the TSDF’s. 
Here it is easily seen that our weighting is able to preserve details better 
than the unweighted volumetric scan merging approach. 

As an another example of the applicability of both our depth map denois- 
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(a) Unweighted VCG Reconstruction (b) Our Weighted VCG Reconstruction 



(c) Different views of our Weighted VGG Reconstruction 
Figure 6: Our reconstruction of a life-size statue of Mahatma Gandhi. 



Figure 7: Comparison of TSDF’s for ordinary volumetric merging (left) with 
our weighted volumetric merging (middle). The extracted zero crossings are 
also shown (right). Zero crossing for unweighted and weighted TSDF’s are 
shown in red and blue respectively. 


18 












(a) Different views of our 3D model 



(b) Kinect Fusion 


Figure 8: (a) shows the reconstructed 3D model of a bust of the scientist C. 
V. Raman rendered from different viewpoints; (b) shows the reconstruction 
obtained using Kinect Fusion. 
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ing and weighted volumetric scan merging, in Figure we show the results 
of our reconstruction of a metal bust of height of about 50 cm of the Indian 
scientist, C. V. Raman. The reconstructed model is shown from different 
viewpoints in Figure |^a). We first filtered the individual depth maps using 
the adaptive bilateral filter detailed earlier in Section |3.2[ These smoothed 
depth maps are converted into 3D scans and registered globally in a sin¬ 
gle frame of reference using the motion averaged ICP (MAICP) algorithm 
presented in m It is worth reiterating here that the adaptive bilateral 
filtering step is essential to make the subsequent registration steps work as 
the raw depth maps from the Kinect are very noisy. Without this filtering 
step, the normals estimated on the scans would be error prone and result in 
registration failures. Once the registration of all scans is achieved, we use 
our modified volumetric scan merging approach detailed in Section 3^ to 
generate a final 3D model. Figure |^b) is the corresponding 3D reconstruc¬ 
tion obtained using Kinect Fusion [2]. As can be seen, the quality of 3D 
reconstruction using our approach is similar compared to Kinect Fusion that 
needs to capture depth maps at video rate and needs expensive hardware to 
process them. Also Kinect Fusion requires the depth camera to be moved 
very slowly and carefully during the scanning process. In this experiment, 
Kinect Fusion was run for more than 5 minutes at a frame rate of around 3-4 
fps, i.e. more than 1000 frames were captured and processed in the recon¬ 
struction. In contrast, our result is obtained using as few as 21 depth maps. 
Yet our result is comparable to that of Kinect Fusion because we take care 
of the uncertainties in individual measurement in an optimal way. 


3.4 Plane Extraction 

The ease of use of depth cameras like Kinect have also made it suitable for 
use in mobile robotics applications such as simultaneous localization and 
mapping (SLAM) [HI [13], navigation etc. In this subsection, we will focus 
our attention to a specific subproblem in a typical SLAM pipeline, i.e. of 
extracting 3D planes from noisy range images. 3D planes are ubiquitous 
in indoor scenes and can be utilized in a variety of ways for aiding motion 
estimation and tracking miiiaii]. Some approaches to plane extraction from 
noisy range images are presented in [371EI1ES]. 133 describes a method 

for plane extraction on the range maps obtained from time-of-ffight scanners. 
This method finds candidate planes by means of an expensive region-growing 
algorithm and the sequential nature of range videos is exploited to achieve 
speed up. m proposes a fast algorithm for plane extraction using RGB-D 
cameras like Kinect, wherein surface normals are computed using integral 
image and clustered to find planar regions. But in m the intrinsic noise 
characteristics of structured-light stereo depth cameras is not exploited. [36| 
describes plane extraction from the range map where the noise is a quadratic 
function of both the distance and the incidence angle. 
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As in many pattern recognition applications, given a hypothesis of a 3D 
plane, we can propose a test for whether a given 3D point belongs to this 
plane. Typically this would involve measuring the distance of the point to the 
plane and comparing it with a fixed threshold. The threshold is determined 
by the expected level of noise in the observed 3D point location. However, 
in the case of depth map observations each 3D point has a different amount 
of noise associated with it, implying that we cannot use a fixed threshold 
in our plane fitting hypothesis test. It is evident that points that are at 
a greater distance need a more relaxed distance threshold in contrast with 
points that are closer to the depth camera. While such a varying threshold 
can be incorporated into the test for plane fitting, it will be recognized that 
we need to hypothesize the parameters of the 3D plane based on the observed 
data itself. In such a scenario, the non-uniformity of the standard deviation 
of observation noise makes it hard to generate valid 3D plane hypotheses. 
However, we can avoid this difficulty if we use disparity maps instead of 
depth maps. 

In Section]^ we had derived the fact that the standard deviation of depth 
measurement varies quadratically with depth. However, this observation 
model is itself an outcome of the assumption that the uncertainty (i.e. stan¬ 
dard deviation) in disparity measurement is a constant and independent of 
depth or disparity. Therefore, for the purposes of 3D plane extraction it 
is particularly advantageous to work with disparity maps instead of depth 
maps. This is true since, for a planar surface, the 2D disparity map D{x^y) 
has an affine relationship with the pixel location (x^y). Consider a point 
P = (X, y, Z) on a 3D plane satisfying the equation aX + bY + cZ + 1 = 0. 
If the point (X, T, Z) gets projected to pixel location (x, y) in the IR camera, 
then, 

x = y^^^v (14) 

where / is the focal length of the camera and (?i, v) is the principal point 
or focal point of the camera. The disparity at the point {x^y) is given by 
D{x^y) — where B is the baseline distance between the projector and 
the camera center. Substituting variables in the equation of the plane we 
have. 


QjX -\- bY -\- cZ “hi — 0 

a{x - u) + b{y - v) + cf + = 0 

Jd 

^ ax + by + —D{x^ y) + {cf — au — bv) — 0 (15) 

B 

The affine relationship in equation helps in formulating a very effi¬ 
cient algorithm for plane extraction using a disparity map. The disparity 
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image is first passed through a Laplacian-of-Gaussian (LoG) filter which is 
a smoothened version of the 2D Laplacian operator and has a high response 
for sharp changes and a zero response for regions that display a linear vari¬ 
ation. The LoG filter is popularly used for edge or blob detection. Since 
the disparity map obeys an affine relationship in planar regions, the ideal 
response of a LoG filter for a planar region should be equal to zero. Gon- 
versely, for non-planar regions we can expect to observe a higher response 
when a LoG filter is applied. Therefore, in our case, LoG response values 
above a given threshold are considered to belong to a non planar region. It 
will be noted here that this threshold is a fixed constant, i.e. independent 
of depth or equivalently disparity. Amongst all pixels that pass this test, we 
find connected components to identify planar regions in the depth maps. For 
each planar region, the parameters of the plane are computed robustly using 
the relationship in equation Gonsequently, the use of a fixed threshold 
on the disparity map helps us in formulating a very fast technique for plane 
detection in depth maps. 

Once the parameters for a 3D plane are estimated, we find more points 
that closely satisfy the relationship in equation The parameters of the 


3D planes are re-estimated using all the fitted points and planes which have 
very close values of parameters are merged into composite planes. We repeat 
this process of estimating parameters and the set of points that fit the plane 
iteratively akin to that of k-means which is a classic iterative algorithm 
for clustering of points. The k-means approach requires the specification 
of the number of clusters. In our plane extraction algorithm, we estimate 
the number of planes from the number of connected components present 
after LoG filtering and thresholding. In this context, when plane parameters 
are close to each other, we merge them into single planes. If a point fits 
multiple plane hypotheses equally well, we use its neighboring pixels in the 
depth map to disambiguate, i.e. we try and find the largest coherent regions 
corresponding to 3D planes in a depth map. 

Before we present results of our plane extraction method, we provide a 
simple experiment that clearly demonstrates the fact that a fixed distance 
approach to plane fitting is inadequate to handle different situations. In 
Figure l^a) and (b) we show an RGB and depth map respectively of a scene 
consisting of 3 planes. The labeled planes A, B and G are in increasing order 
of distance from the depth camera. While plane G is at a distance of about 
10 feet from the depth camera, planes A and B are located much closer at a 
distance of 2 feet. The depth map in Figure |^b) is histogram equalized for 
ease of visualization. Finally, planes A and B are at about 2 feet from the 
depth camera, but differ in depth by 1 cm, i.e. differ in the thickness of the 
book shown in plane A. In Figure |^c) the result of our plane extraction is 
shown. Glearly, all three planes are correctly and distinctly identified. 

In Figure [^d) we demonstrate why a fixed distance threshold based ap¬ 
proach is not adequate. To obtain the best possible parameter estimates 
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A 


B 


C 


(a) 



(c) 



(d) 

Figure 9: (a) Color image of a scene consisting of three planes, (b) Depth 
map of the three planes (histogram equalized for ease of visualization), (c) 
Result of our plane extraction method. Three extracted planes are marked 
with three colors. Gray regions do not belong to any of the planes, (d) 
Result of using a fixed threshold based plane fitting. This clearly shows that 
neither a small or large threshold is adequate for all three planes. Please see 
the text for details. 
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for the 3 planes we manually segment the three planes in the depth map 
and separately compute the parameters of the three planes using principal 
component analysis on the corresponding 3D points for each plane. Sub¬ 
sequently, we specify a threshold T which determines whether a 3D point 
belongs to a given plane. If a 3D point is within distance T from a given 
plane, we may classify it as belonging to that plane. For our test we used 
two different values of T, i.e. T = 5 mm and T = 20 mm. From Figure |^d) 
we see that when T = 5 mm, there is a good fit for most points in planes 
A and B with their respective plane models. However, most points on plane 
C do not satisfy this test. Since plane C is further from the camera, the 
noise in individual depth points is large, hence for a small threshold T, most 
points on plane C would be declared to have failed to fit the model for plane 
C, i.e. they would fail to be classified as belonging to plane C. However, 
although many points on C are incorrectly classified as not belonging to C, 
for the small threshold T, no point is wrongly classified as belonging to a 
different plane, i.e. points are distinguishable since the small threshold T is 
a stringent test. When we use the larger threshold T = 20 mm, we see in 
Figure l^d) that we are no longer able to correctly classify points into planes 
A and B. In fact, since the classification threshold T is relaxed (i.e. larger), 
we can no longer distinguish plane A from B. However, in this case, most 
of the points on plane C are correctly classified as they now he within the 
larger distance threshold of T = 20 mm. Hence, from this experiment we 
may conclude that we cannot use either a small or a large fixed threshold 
to correctly classify points on planes under different scenarios. In contrast, 
since we take into account the true uncertainty of depth measurements, our 
approach can be seen to correctly classify the points into three distinct planes 
in Figure [^c). We now proceed to demonstrate the efficacy of our method 
with more results in the next subsection. 


3.5 Plane Extraction Results 


In this subsection, we demonstrate some results of our disparity map based 
plane extraction algorithm. In Figure [To|(a) we show our results on two real 
world scenes. In Figure [To|(b) we show the result of our plane extraction 
algorithm applied to more complex scenes from the RGB-D SLAM dataset 
m- The plane extraction results of our method are depicted by overlaying 
color coded regions onto the corresponding RGB images. As can be seen, 
our approach effectively captures the significant planar regions present in the 
scene and this representation can be used for motion estimation in a SLAM 
context mi na n. In Figure we show the extracted planes as color 


coded regions on the corresponding RGB images for two depth frames of a 
SLAM sequence (frames 204 and 207 of the freiburgl_teddy sequence of 
the RGBD-SLAM dataset (category 3D object reconstruction)). Using the 
corresponding analytic models of the matched or corresponding 3D planes. 


24 




(a) Plane extraction on two laboratory scenes. Left to right: Depth map, RGB image and 
the result of plane extraction. Black regions are identified as being non planar. 



(b) Plane extraction on instances from the RGB-D SLAM Dataset are shown as color coded 
regions overlaid on the corresponding RGB images. Regions identified as non-planar are 
shown in gray. 


Figure 10: Results of applying our plane extraction method to different depth 
maps. Please see this figure in color. 


we estimate the 3D Euclidean motion between the two depth frames. For 
ground truth, we apply the point-to-plane ICP to estimate the Euclidean 
motion between the two depth framesThe initial rotation between the 
two frames is 22.7° whereas our rotation estimate bring the two frames to 
be as close as 2.36°. In other words, using our plane extraction method 
can provide a fast and reliable estimate of the motion of the depth camera 
which can be used to initialize 3D motion estimation in a SLAM framework. 
Apart from significantly speeding up the 3D motion estimation, by providing 
a good initial estimate for the relative motion between the two frames, our 
method can ensure that we are within the region of convergence for greedy 
algorithms such as ICP. 

^ Since the ground truth provided with this dataset is not synchronized in time with the 
raw data, we have chosen to use the results of IGP as ground truth for this experiment. 
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Figure 11: Use of plane extraction for motion estimation. Plane extracted 
from two frames of teddy sequence using our method. Common planes are 
then manually matched and used for rotation estimation between these two 
frames. Please see text for details. 

4 Conclusion 

In this paper we have studied the characteristics of the noise present in 
structured-light stereo based depth cameras. We derive a theoretical model 
to account for the standard deviation of noise present and then provide 
experimental validation of this noise model. In addition, we demonstrate 
the gains to be had by incorporating this noise model into three important 
applications, i.e. depth map denoising, volumetric scan merging and plane 
extraction. Extensive experimental results are presented for each of these 
applications that validate the utility of our noise model. 
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